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Introduction 


FROSTBITE EVOLUTION ( 


Frostbite 2007 vs 2017 


» DICE next-gen engine » The EA engine 
> Built from the ground up for » Evolved and scaled up for 
> Xbox 360 » Xbox One 
PlayStation 3 » PlayStation 4 
> Multi-core PCs » Multi-core PCs 
> DirectX 9 SM3 8 Direct3D 10 » DirectX 12 


> To be used in future DICE games > Used in ~15 current and future EA games 


http://www.frostbite.com/2007/04/frostbite-rendering-architecture-and-real-time- 
procedural-shading-texturing-techniques 
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More diverse games 
Not just Battlefield engine 
RPG, Racing, Sports, Action 


Rendering system overview 07 


Game Renderer 


World Renderer 


Terrain 


Particles 


Undergrowth Decals 


Meshes Shading system 


Direct3D / lib GCM 


http://www.frostbite.com/2007/04/frostbite-rendering-architecture-and-real-time- 
procedural-shading-texturing-techniques slide 17 


Rendering system overview 17 


Game Renderer 


World Renderer Post-processing 


E Volumetric FX 
Terrain 


Particles 


Undergrowth 
Gl Game-specific 
rendering 


Reflections | Meshes Shading system -= 


Shadows PBR 


Direct3D 11 / Direct3D 12 / libGNM 
(Metal / GLES / Mantle 


Not meant to be a complete / representative graph, just illustrating the scaling 
challenges. 


Basically the same, except larger number of systems with more complicated coupling. 


Mostly unchanged since 2007 
Until recently! 
Everything scaled up 
More features 
Much larger community 
Scaling and maintenance challenges 


Shading system described in detail by Johan in 2007. 


Rest of this talk will be about World Renderer and rendering features. 


Rendering system overview (simplified) 


World Renderer 
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Features 


Features 
Shading System 


Render Context 


GFX APIs 


WorldRenderer 


» Orchestrates all rendering E 


> architecture 


Shading ) 
System 


> Main world geometry (via 


> Lighting, Post-processing (via genes) i 
Shading System 


» Knows about all views and render passes 
» Marshalls settings and resources between systems 


> Allocates resources (render targets, buffers) 


World Renderer architecture is the focus of the presentation from this point. 


Battlefield 4 rendering passes (Features) ) 


refle Capture 
planarReflections 
dynamicEnvmap 
mainZPass 

mainG Buffer 
mainGBufferSimple 
mainGBufferDecal 


mainGBufferFixup 
msaaZDown 

msaaClass 
lensFlareOcclusionQueries 
lightPassBegin 


Shadowmaps 


downsampleZ 
linearizeZ 
ssao 
hbaoHalíZ 
hbao 

ssr 
halfResZPass 
halfResTransp 
mainDistort 
lightPassEnd 
mainOpaque 


linearizeZ 


mainTransDecal 
fgOpaqueEmissive 
subsurfaceScatteri 
skyAndFog 
hairCoverage 
mainTransDepth 
linerarizeZ 
mainTransparent 
halfResUpsample 
motionBlurDerive 
motionBlurVelocity 
motionBlurFilter 
filmicEffectsEdge 


VNS VEE YT FEEN 


ansparent 
lensScope 
filmicEffects 
bloom 
luminanceAvg 
finalPost 
overlay 
fxaa 
smaa 
resample 
screenEffect 


hmdDistortion 


cascadedShadowmaps mainOpagueEmissive spriteDof 


This is the rendering passes we had in Frostbite few years ago for Battlefield 4. Our 
pipeline now with PBR has even more passes and complexity. 


WorldRenderer challenges 


Explicit immediate mode rendering 
Explicit resource management 
» Bespoke, artisanal hand-crafted ESRAM management 
» Multiple implementations by different game teams 
Tight coupling between rendering systems 
Limited extensibility 
Game teams must fork / diverge to customize 
Organically grew from 4k to SLOC 
> Single functions with over 2k SLOC 


> Expensive to maintain, extend and merge/integrate 


World Renderer state as it evolved from 2007 to 2016. 


World Renderer 


Shading System 
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Modular WorldRenderer goals 


> High-level knowledge of the full frame 
Worid Renderer 

> Improved 
» Decoupled and composable code modules 
» Automatic resource management am 


» Better visualizations and diagnostics 


Major World Renderer re-architecture in 2016 to address accumulated technical debt, 
improve extensibility and maintainability. 


Not micro-management of explicit passes and resources 
Not hacking monolithic functions inside engine code 
Not baby-sitting of memory allocation and aliasing. 


12 


New architectural components 


» Frame Graph World Renderer 
> High-level representation of 


and Features Features 


> Full knowledge of the frame 
‘ 2 Frame Graph 
» Transient Resource System 
» Resource allocation Transient Resources Shading System 


» Memory aliasing 


Render Context 


GFX APIs 
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Frame Graph 


Frame Graph goals 


> Build of the entire frame 
> Simplify resource management 
> Simplify rendering pipeline configuration 
> Simplify async compute and resource barriers 

> Allow self-contained and 


> Visualize and debug complex rendering pipelines 


15 


Frame Graph example 


Depth pass 


Depth Buffer Depth Buffer 


Gbuffer 1 ERR 2. 
Gbuffer pass Lighting Lighting buffer 
Gbuffer 2 


Gbuffer 3 Post 


Render and resources for the entire Backbuffer 


frame expressed as a directed acyclic graph 
Present 


Toy example of a frame graph that implements a deferred shading pipeline. 
The graph contains render passes and resources as nodes. 
Access declarations / dependencies are edges. 


Graph of a Battlefield 4 frame 


se ITU m 


Debug graph visualization using GraphViz. Output is a searchable PDF (not static 
image). 

Graphs can be surprisingly large and complex. 

While can be useful in some cases, it’s definitely not the primary visualization tool 
that we ended up using. 
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We wrote a custom visualization script (HTML+Javascript and JSON data exported 
from runtime). 

JSON data contains information about all render passes and resources. 

For each render pass we know which resources were created, read or written. 

For each resource we know its complete memory layout and various metadata 
(debug name, size, format, etc.). 

The visualization is interactive and provides a much more useful overview of what's 
going on in a frame, similar to what you'd find in PIX. 
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Frame Graph design 


Moving away from immediate mode rendering 
Rendering code split into 
Multi-phase retained mode rendering API 
Setup phase 
Compile phase 
Execute phase 
Built from scratch every frame 


Code-driven architecture 


FrameGraph is a step away from immediate mode rendering towards retained. 

We build the graph every frame from scratch, since rendering configuration may 
change dynamically based on player actions, cut-scenes, etc. 

The big assumption is that setup phase is relatively cheap, as we're only dealing with 
a relatively small number of render passes and resources. 
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Frame Graph setup phase 


Setup 
Compile 
» Define render / compute passes EE 
> Define and resources for each pass 
» Code flow is similar to immediate mode rendering 


Flow is similar to IM rendering, but we are not generating any GPU commands during 
this phase. Just building up the information about rendering operations for the frame. 
All resources are virtua! during graph building. Render pass inputs and outputs are 
declared using virtual resource handles. 
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Frame Graph resources 


Setup 
Compile 

» Render passes must declare all used resources A 

> 

> 

> 
> External permanent resources are to Frame Graph 

> History buffer for TAA 

> Backbuffer 


> etc. 


Some render passes may have effects that are not visible to FrameGraph (for example 
data read-back from GPU). Such passes are explicitly marked as having side-effects. 


Some persistent render targets are still required (TAA, SSR, etc.). They can be 
imported into FrameGraph. 


Writing to imported resource counts as side-effect of a render pass, which ensures 
that it is not culled during the compilation phase. 
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Frame Graph resource example 


RenderPass::RenderPass(FrameGraphBuilderg builder) 


{ 


FrameGraphTextureDesc desc; 

desc.width = 1280; 

desc.height = 720; 

desc.format = RenderFormat D32 FLOAT; 

desc. = FrameGraphTextureDesc: : 

m renderTarget = builder. (desc); 


RenderPass Render Target 


Simple dummy render pass that produces a render target resource. 
Very similar to creating a regular texture, except we also specify initial resource state 


(clear or discard/undefined) 


Frame Graph setup example 


RenderPass::RenderPass(FrameGraphBuilderg builder, 


FrameGraphResource input, 
FrameGraphMutableResource renderTarget) 


m input = builder. (input, readFlags); 
m renderTarget = builder. (renderTarget, writeFlags); 


Input 


RenderPass t i 


Render Target 
(version 1) 


Render pass that reads from one texture and writes to another. 
Writing to a texture produces a renamed handle. This allows us to catch errors when 
resources are modified in undefined order (when same resource is written by 


different passes). 
Renaming resources enforces a specific execution order of the render passes. 


Advanced FrameGraph operations 


» Deferred-created resources 
» Declare resource early, allocate on first actual use 
» Automatic resource bind flags, based on usage 
» Derived resource parameters 
> Create render pass output based on input size / format 
> Derive bind flags based on usage 
» MoveSubresource 
» Forward one resource to another 
» Automatically creates sub-resource views / aliases 
> Allows “time travel" 


Abstract / virtualized resources allow some convenient tricks. Resources may be 
declared early, but their memory will be allocated only on first use. An example use 
case is a depth buffer resource. We know that we will need one to do 3D rendering, 
but we don't necessarily know (or care) if our rendering pipeline is using depth pre- 
pass. Depth pre-pass, gbuffer pass and forward-shaded geometry passes all simply 
write to the resource that they reguire to be declared early. 

FrameGraph resource handles have metadata attached to them that can be gueried 
during setup phase. 

This allows some render passes to create derived resources. For example, a generic 
down-sample render pass can create an output resource that shares all properties of 
the input, but overrides width/height. Resource bind flags can be also automatically 
computed based on how the resource is used. The pass that creates a render target 
resource does not need to know that this resource is going to be used as a UAV, etc. 
One of the more magical operations that are possible on virtualized resources is 
MoveSubresource. Create aliases of resources that will be created by the future 
render passes. 
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MoveSubresource example 


Deferred shading module 
Depth pass 
Depth Buffer Depth Buffer 


Gbuffer 1 Lighting buffer 
Gbuffer pass Lighting 2D Render Target 
Gbuffer 2 Subresource 5 


Gbuffer 3 


Reflection 5 
probe Convolution Cubemap c- 


Reflection module 


A generic rendering pipeline can be implemented that creates an output texture, 
which is a simple 2D image resource. 

It can be combined with a reflection probe filtering pipeline that takes a cubemap 
input. 

Move operation can be used to assign "Lighting buffer” resource to one of the 
cubemap faces. 

This causes the deferred shading module to write directly to the cubemap face, 
instead of creating a separate render target. 

The same deferred shading module can be used in a different context, without move 
operation. In this case FrameGraph will allocate a transient render target for the 
output. 


25 


Frame Graph compilation phase 


resources and passes 

> Can be a bit more sloppy during declaration phase 

» Aim to reduce configuration complexity 

> Simplifies conditional passes, debug rendering, etc. 
> Calculate 
> Allocate concrete GPU resources based on usage 

> Simple greedy allocation algorithm 

> Acquire right before first use, release after last use 

> Extend lifetimes for async compute 


> Derive resource bind flags based on usage 


FrameGraph data structures 


Flat array of used resource handles per RenderPass 
Flat array of RenderPasses in FrameGraph 
Flat array of resources in ResourceRegistry 
Resource handles are just indices into this array 
Compilation phase linearly walks through all RenderPasses 
Computes reference counts for resources 
Computes first and last users for resources 
Computes async wait points and resource barriers 
RenderPass execution order is defined by setup order 
No re-ordering during compilation 


Culling algorithm 
Simple graph flood-fill from unreferenced resources. 


Compute initial resource and pass reference counts 


Setup 
Compile 


Execute 
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renderPass.refCount++ for every resource write 
resource.refCount++ for every resource read 
Identify resources with refCount == 0 and push them on a stack 

While stack is non-empty 
Pop a resource and decrement ref count of its producer 
If producer.refCount == 0, decrement ref counts of resources that it 
reads 

Add them to the stack when their refCount == 
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Sub-graph culling example 


Depth pass 


Depth Buffer Depth Buffer 


Gbuffer 1 FEE æn 
Gbuffer pass Lighting Lighting buffer 
Gbuffer 2 


Gbuffer 3 Bert 


Debug View Final target 
Debug output texture is not 9 
consumed, therefore it and 
the render pass are culled Debug output Present 


It is sometimes convenient to add render passes and resources to the graph without 


checking if they are needed first. 

For example, we can always add certain debug visualizations or specialized passes, 
such as depth buffer linearization. 

This cuts down on the rendering pipeline configuration complexity a bit. 


Sub-graph culling example 


Depth pass Lighting and postprocessing parts of 


the pipeline are automatically disabled 
Depth Buffer 
Depth Buffer 


Gbuffer 1 — Po 
Gbuffer pass Lighting Lighting buffer 
Gbuffer 2 


Gbuffer 3 PER 


Debug visualization is Debug View 
switched on by connecting 

the debug output to the 

back buffer node Debug output 


Final target 


Present 


The engine contains many features and deciding whether to execute a certain render 
pass can be a chore. It also introduces some coupling between passes. 


Lighting passes don't need to know anything about the debug output. When debug is 
enabled, it overrides the lighting output, which will cull it. This leads to more 
decoupled / modular code. 
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Frame Graph execution phase 


Setup 
Compile 

> Execute callback functions for each render pass ES 
» Immediate mode rendering code 

» Using familiar RenderContext API 

> Set state, resources, shaders 

> Draw, Dispatch 
> Get GPU resources from handles generated in setup phase 


Execution phase is quite simple. Iterate over render passes that are not culled and 


call their execution callback function. 
This phase is almost identical to how rendering was done before FrameGraph. Just 
use RenderContext API, except must de-virtualize FrameGraph resources first. 
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Async compute 


Could derive from dependency graph automatically 
Manual control desired 

» Great potential for savings, but... 

» Memory increase 

» Can hurt performance if misused 
Opt-in per render pass 

on main timeline 

Sync point at first use of output resource on another queue 


Resource lifetimes automatically extended to sync point 


Efficient async compute requires some hand-holding today. While we do have all the 
render pass and resource dependencies in the graph, we don’t know what 
bottlenecks will exist on the GPU during execution. Don’t want bandwidth-heavy 
compute passes to run with bandwidth-heavy graphics work (shouldn’t be news to 
anyone). 


Async operations will increase memory water mark, because resource lifetimes are 
extended (more resources are alive simultaneously). Need to be a bit careful. 


Ended up with a manual opt-in mechanism for render passes. Async passes are kicked 
off on the main timeline at the point where they’d execute serially (we don’t re-order 
passes). 

Synchronization point is automatically added on the main pipeline before the first 
render pass that consumes the output of async pass. 
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Async compute 


Main queue Depth pass SSAO Filter 


Depth Buffer 


Raw AO 


Example compute-based AO filtering pipeline. 


Shadows 


Filtered AO 


Lighting 
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Async compute 


Sync point 


Main queue Depth pass Shadows Lighting 


Async queue SSAO Filter L = 


Depth Buffer 
Raw AO 


Filtered AO 


AO buffer generation and filtering can be moved to async queue, but resource 
lifetimes must be extended a bit (up to the sync point). 
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Frame Graph async setup example 


AmbientOcclusionPass: :AmbientOcclusionPass(FrameGraphBuilder& builder) 


{ 


builder. (true) ; 


This is all you have to do in high-level code. Super simple to answer questions like 
“what would happen if we ran this async?”. 


In the future we'd like to explore automatic render pass re-ordering, perhaps with 
profile-guided optimization step. 
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Pass declaration with C++ 


» Could just make a C++ class per RenderPass 
> Breaks code flow 
> Requires plenty of boilerplate 
> Expensive to port existing code 
> Settled on 
> Preserves code flow! 
> Minimal changes to legacy code 
> Wrap legacy code in a lambda 


» Add aresource usage declarations 


Programmer convenience is very important. Don’t want to introduce too much 
boilerplate code or break the code flow. 

Started with a C++ class with virtual execute(), but quickly realized that such approach 
requires moving quite a lot of code around. 

It also requires a bit of plumbing to pass data between setup and execution phases. 


Implemented a lambda-based API to improve the convenience. This also greatly 
simplified the effort of porting legacy rendering code to the new system. 

Started by simply wrapping huge chunks of code in lambdas. Gradually replaced raw 
resources with transients and sub-divided monolithic lambdas into smaller ones. 
Eventually moved out final small code blocks into stand-alone functions. Spaghetti is 
mostly untangled © 


The price that we have to pay is a template-heavy FrameGraph setup API. 
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Pass declaration with C++ lambdas 


FrameGraphResource (FrameGraph& frameGraph, 
FrameGraphResource input, FrameGraphMutableResource output) 
í 
struct 


{ 
t 


FrameGraphResource input; 
FrameGraphMutableResource output; 


h 


auto& renderPass = frameGraph. (“MyRenderPass“, 
(RenderPassBuilder& builder, PassData data) 


T 
t 


data.input = builder. (input); 
data.output = builder. (output).targetTextures[08); 


, 
(const PassData& data, const RenderPassResources& resources, IRenderContext* renderContext) 

r 

t 


drawTexture2d(renderContext, resources.getTexture(data.input)); 


bi 


return 


addCallbackPass() is a template function that creates a render pass class behind the 
scenes that's parametarized by the PassData and the execution lambda. 


Setup lambda is inlined in addMyPass(), but execute lambda is deferred. Setup 
lambda may capture everything by reference, but execute must capture by value. 
Capturing data by value is a little bit dangerous since it’s possible to accidentally 
capture a pointer that’s released before execution phase. It’s also possible to 
accidentally capture huge structures by value. Luckily, we can enforce that the size of 
execution lambda is below a certain size at compile time (we settled on 1KB limit). 
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Render modules 


» Twotypes of render modules: 


Free-standing functions 
> Inputs and outputs are Frame Graph resource handles 
» May create nested render passes 
» Most common module type in Frostbite 
Persistent render modules 
» May have some persistent resources (LUTs, history buffers, etc.) 


> WorldRenderer still orchestrates high-level rendering 


> 
> 
> 
> 


Does not allocate any GPU resources 

Just kicks off rendering modules at the high level 
Much easier to extend 

Code size reduced from 15K to 5K SLOC 
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Communication between modules 


» Modules may communicate through a 
» Hash table of components 
> Accessed via component Type ID 


> Allows 


#include "BlurModule.h" 
::renderBlurPyramid( void createBlurPyramid( 
FrameGraph& frameGraph, FrameGraph& frameGraph, 
FrameGraphBlackboard& blackboard) const FrameGraphBlackboard& blackboard) 


autod. blurData = blackboard. < >(): const auto& blurData = blackboard. 
addBlurPyramidPass(frameGraph, blurData); addTonemapPass(frameGraph, blurData); 


Not necessarily need a single global blackboard. Modules may create their own 
blackboard within their setup scope, propagate some data from the parent into it and 
then copy results into parent blackboard at the end. 


While blackboard is great for decoupling, it does make the code harder to 
understand. If a module takes a blackboard as a parameter, it's not possible to tell at 
the call site which resources will actually be accessed. 

The module code itself must be viewed to answer this. 


Invalid blackboard access can only be validated at run-time. 


On balance, we believe that the benefits outweigh the drawbacks. 
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Transient Resource System 


The back-bone of FrameGraph. 


38 


Transient resource system 


Transient /'tranzant/ adjective 
Lasting only for a short time; impermanent. 


Resources that are alive for no longer than one frame 
> Buffers, depth and color targets, UAVs 
> Strive to minimize resource life times a frame 
Allocate resources where they are used 
> Directly in leaf rendering systems 
» Deallocate as soon as possible 
» Make it easier to write self-contained features 
Critical component of 
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Transient resource system back-end 


> Implementation depends on platform capabilities 
> Aliasing in physical memory ( XBI ) 
> Aliasing in virtual memory (*DXi2. Ps4 |) 
> Object pools ( Dx11_) 

» Atomic linear allocator for 
» No aliasing, just blast through memory 


» Mostly used for sending data to GPU 
DX11 PC 


Efficiency 


» Memory pools for 
Complexity 


DX12 PC 
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Transient textures on PlayStation 4 


Depih pass SSAO Gbuffer pass Lighting Post 
Depth Buffer Final output 
AO Waste due to fragmentation 
Gbuffer 1 
Gbuffer 2 
Gbuffer 3 


Address 


Lighting buffer 


Reserve a single large virtua! memory pool 
Allocate texture virtual memory block on first use 
Use a general purpose non-local memory allocator 
Patch or allocate GNM resource descriptors as needed 
Return virtual memory block after last use 
Commit physical memory to cover VA range used in current frame 
Grow the physical memory pool on demand 
Shrink down to the high water mark of last N frames 
Resources overlap in virtua! address space 
Understood natively by PS4 graphics debugging tools (Razor) 
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Transient textures on DirectX 12 PC 


Depth pass SSAO Gbuffer pass Lighting Post 


Depth Buffer Final output 


AO 


Many small 
Gbuffer 1 heaps mean 


Gbuffer 2 fragmented 
address space 
Gbuffer 3 


Heap Lighting buffer 


A bit similar to PS4, except many disjoint address ranges instead of just one. 

Can't use a single range, as it’s impossible to shrink it without stalling the GPU or 
temporarily increasing memory usage. 

Despite these shortcomings, we're still able to re-use memory sometimes and see a 
significant overall water mark reduction. 


Frostbite does not currently perform global memory allocation optimization, but it 
could theoretically be implemented. A global optimization pass would allow merging 
Heap 2 into Heap 6. This would bring down the overall number of heaps and the 
memory water mark. 


Concrete problems with resource heaps in current D3D12: 


Tier 1 heaps have restrictions on types of resources that can be placed in them. Only 
buffers or only textures or only render targets and depth buffers. Must create 
separate heaps for different resource types. Most transient resources that we alias 
are RT or DS, so it’s not too bad. We force the RT flag on a transient texture even if 
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user did not specifically reguest it. 


Tier 2 heaps are better, as all types of resources can be aliased. They are still not 
ideal, as we must allocate many heaps and sub-allocate within them. This leads to 
more fragmentation compared to allocating from a single large address range. We 
can't allocate a single huge heap, as we can't shrink it. Compromise is to create one 
large-ish persistent transient resource heap and then create smaller overflow heaps. 


Once a resource is created, it can't be moved. This means that if memory allocation 
“schedule” changes a bit, some objects will change their placements and will have to 
be re-created. It's possible to work around this issue to some degree by caching 
placed D3D objects (resources and various views) and re-using them when possible 
(potentially many frames later, when allocation schedule changes again to a 
compatible one). Resource allocation schedule may change based on player actions, 
cut-scenes, UI, etc. However, there is typically only a handful of unigue schedules, so 
it's possible to use an LRU cache. 


These problems simply don't exist on consoles. Tiled resources could be guite 
convenient in the future (almost the same level of efficiency as XB1 memory aliasing), 
however as of October 2016 there are significant CPU and GPU overheads to using 
them as RTVs / DSVs. Additionally, resource heap tier restrictions prevent efficient tile 
mapping updates via CopyTileMappings. We sometimes want to use multiple heaps 
as page sources to back a single resource. Current UpdateTileMappings API can only 
take a single heap pointer, therefore multiple API calls are required. 
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Transient textures on Xbox One 


Depth pass SSAO Gbuffer pass Lighting Post 


Depth Buffer Final output 
AO Lighting buffer 
Gbuffer 1 


Gbuffer 2 Light buffer is disjoint 


in memory 
Gbuffer 3 


Lighting buffer 


Fragmentation-free dynamic memory allocation and aliasing 
Close to optimal ESRAM utilization automatically 

Don’t need contiguous memory blocks 

Resources may be fully or partially in ESRAM 

Overflow to DRAM when every ESRAM page is in use 
Hand-tune memory allocation based on profiling 

Deny ESRAM for some resources 

Allocate ESRAM top-down or bottom-up 

Restrict ESRAM to % of the resource, place rest in DRAM 
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Transient textures on Xbox One 


Depth pass SSAO Gbufferpass Lighting Post 
Depth Buffer 
AO 
Gbuffer 1 
Gbuffer 2 
Gbuffer 3 


a 
a 
9 

ke) 

xe) 
< 


Lighting buffer 


Final output 


Use a physical memory pool of ESRAM and DRAM pages 
Allocate all resources at unigue virtual addresses 
VirtualAlloc, CreatePlacedResourceX 
Allocate physical memory pages from pool on first use 
ESRAM pages first, overflow to DRAM 
Extend DRAM pool on demand 
Shrink DRAM pool based on high water mark of last N frames 
Return physical pages to the pool after last use 
Update GPU page table before executing other commands 
XB1-specific ID3D12CommandQueue API 
Conceptually similar to ConyTileMappings 
Page table update happens on GPU timeline 


Page 0 
Page 1 
Page 2 
Page 3 
Page 4 
Page 5 


memory pool 
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Memory aliasing considerations 


» Mustbe 
> Ensure valid resource metadata state (FMASK, CMASK, DCC, etc.) 

» Perform fast clears or discard / over-write resources or disable metadata 
» Ensure resource lifetimes are correct 

> Harder than it sounds 

» Account for compute and graphics pipelining 

» Account for async compute 

> Ensure that physical pages are written to memory before reuse 
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DiscardResource 8 Clear 


Must be the first operation on a newly allocated resource 
Requires resource to be in the or state 
Initializes resource metadata (HTILE, CMASK, FMASK, DCC, etc.) 
> Similar to performing a fast-clear 
» Resource contents remains undefined (not actually cleared) 


Prefer over when possible 
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Aliasing barriers 
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Aliasing barriers 


Add synchronization between work on GPU 

Add necessary cache flushes 

Use precise barriers to minimize performance cost 

Can use barriers for difficult cases (but expect IHV tears) 


Since different rendering passes may use the same physical memory, we need to add 
synchronization point between them to make sure that they don't run in parallel and 
overwrite each others memory. 

We do this by using aliasing barriers, which will add the necessary pipeline and cache 
flushes. 


Aliasing barrier example 


> Potential aliasing hazard due to pipelined and PS work 
> and PS use different D3D sources, so transition barriers aren't enough 


» Must flush before PS or extend resource lifetimes 


Graphics and compute passes don’t overlap logically, but run in parallel because they 
are independent (they don’t have any producer-consumer relationship or any shared 
logical resources). 


49 


Aliasing barrier example 


> Serialized compute work ensures correctness when memory aliasing 
> May hurt performance in some cases 


» Use explicit async compute when overlap is critical for performance 


Using explicit async compute allows us to ensure that resource memory isn’t released 
until compute chain is done. 


In this particular example, overlapped and non-overlapped versions have exactly the 
same performance characteristics. Overlap does not always improve perf. In fact, it 
may sometimes hurt it. 
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Zare. a 


Transient resource allocation results 
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Non-aliasing memory layout (720p) 


147 MB total 
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DirectX 12 PC memory layout (720p) 


80 MB total 
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PlayStation 4 memory layout (720p) 


77 MB total 
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Xbox One memory layout (720p) 


76 MB total 


32 MB ESRAM 
44 MB DRAM 
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What about 4K? 
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Non-aliasing memory layout (4K, DX12 PC) 


1042 MB total 
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Aliasing memory layout (4K, DX12 PC) 


472 MB total 
570 MB saved 


… now we finally have space for those 16k"2 eyeball textures! 
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Conclusion 
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Summary 


» Many benefits from 
» Huge memory savings from resource aliasing 
» Semi-automatic async compute 
> Simplified rendering pipeline configuration 
> Nice visualization and diagnostic tools 
are an attractive representation of rendering pipelines 
> Intuitive and familiar concept 
> Similar to CPU job graphs or shader graphs 
» Modern C++ features ease the pain of retained mode API 


Full frame knowledge and visualization is an awesome tool that allowed us to spot 
inefficiencies in resource allocation, possibilities for async compute, etc. 
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Future work 


» Global optimization of resource barriers 
» Async compute bookmarks 
> Profile-guided optimization 

> Async compute 

> Memory allocation 

> ESRAM allocation 


We're only starting to scratch the surface of what's possible with the modern 
rendering engine architecture and APls. We expect to see more engines in the future 


moving to a similar design (high-level frame setup), since that appears to be the most 
optimal way to drive DX12 and Vulkan style renderers. 
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Questions? 


The End 
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